XML example and exercise


  • study examples of accessing nodes in XML tree structure
  • work on exercise to be completed and submitted



In [1]:
from xml.etree import ElementTree as ET

XML example


In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text


Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra

In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]


* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

  1. 10 countries with the lowest infant mortality rates
  2. 10 cities with the largest population
  3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
  4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
import pandas as pd
import numpy as np

document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

Question 1


In [6]:
country,mortality = [],[]
for c in root.findall('country'):
    country.append(c.find('name').text)
    try:
        mortality.append(float(c.find('infant_mortality').text))
    except:
        mortality.append(np.nan)

mortalities = pd.concat({'country':pd.Series(country),
                      'mortality':pd.Series(mortality)},axis=1)
mortalities.sort('mortality',ascending=True).head(10)


Out[6]:
country mortality
38 Monaco 1.81
98 Japan 2.13
117 Bermuda 2.48
36 Norway 2.48
106 Singapore 2.53
37 Sweden 2.60
10 Czech Republic 2.63
78 Hong Kong 2.73
79 Macao 3.13
44 Iceland 3.15

Question 2


In [7]:
country, city, population = [], [], []
for c in root.findall('./country/city'):
    country.append(c.attrib['country'])
    city.append(c.find('name').text)
    try:
        population.append(int(c.find('population').text))
    except:
        population.append(np.nan)

cityPops = pd.concat({'country':pd.Series(country),
                      'city':pd.Series(city),
                      'population':pd.Series(population)},axis=1)
cityPops.sort('population',ascending=False).head(10)


Out[7]:
city country population
176 Seoul ROK 10229262
128 Hong Kong HONX 7055071
164 Al Qahirah ET 6053000
80 Bangkok THA 5876000
92 Ho Chi Minh VN 3924435
177 Busan ROK 3813814
216 New Taipei RC 3722082
89 Hanoi VN 3056146
163 Al Iskandariyah ET 2917000
215 Taipei RC 2626138

Question 3

Processing

In [8]:
population, country, year = [], [], []
for c in root:
    current = c.find('name').text
    for pop in c.findall("population"):
        country.append(current)
        year.append(int(pop.attrib['year']))
        try:
            population.append(int(pop.text))
        except:
            population.append(np.nan)
pops = pd.concat({'country':pd.Series(country),
                      'population':pd.Series(population),
                      'year':pd.Series(year)},axis=1)
pops.head(10)


Out[8]:
country population year
0 Albania 1214489 1950
1 Albania 1618829 1960
2 Albania 2138966 1970
3 Albania 2734776 1980
4 Albania 3446882 1990
5 Albania 3249136 1997
6 Albania 3304948 2000
7 Albania 3069275 2001
8 Albania 2800138 2011
9 Greece 1096810 1861

In [9]:
#ref: http://stackoverflow.com/questions/27488080/
pops = pops.groupby('country').apply(lambda g: g[g.year==g.year.max()])
pops = pops.reset_index(drop=True)
pops.head()


Out[9]:
country population year
0 Afghanistan 26023100 2013
1 Albania 2800138 2011
2 Algeria 37062820 2010
3 American Samoa 55519 2010
4 Andorra 78115 2011

In [10]:
country,group,percentage = [], [], []

for c in root:
    current = c.find('name').text
    for pop in c.findall("ethnicgroup"):
        country.append(current)
        group.append(pop.text)
        percentage.append(float(pop.attrib['percentage']))

groups = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(group),
                      'percentage':pd.Series(percentage)},axis=1)

groups.head()


Out[10]:
country name percentage
0 Albania Albanian 95.0
1 Albania Greek 3.0
2 Greece Greek 93.0
3 Macedonia Macedonian 64.2
4 Macedonia Albanian 25.2

In [11]:
combined = pd.merge(groups,pops,on="country")
combined['subpop'] = (combined.percentage*combined.population)//100
combined.head()


Out[11]:
country name percentage population year subpop
0 Albania Albanian 95.0 2800138 2011 2660131
1 Albania Greek 3.0 2800138 2011 84004
2 Greece Greek 93.0 10816286 2011 10059145
3 Macedonia Macedonian 64.2 2059794 2011 1322387
4 Macedonia Albanian 25.2 2059794 2011 519068
Answer:

In [12]:
combined.groupby('name').sum().sort('subpop',ascending=False).subpop.head(10)


Out[12]:
name
Han Chinese    1245058800
Indo-Aryan      871815583
European        494872201
African         318325104
Dravidian       302713744
Mestizo         157734349
Bengali         146776916
Russian         131856989
Japanese        126534212
Malay           121993548
Name: subpop, dtype: float64

Question 4


In [13]:
# a)
country, name, length = [], [], []
for r in root.findall('.//river'):
    country.append(r.attrib['country'])
    name.append(r.find('name').text)
    try:
        length.append(int(r.find('length').text))
    except:
        length.append(np.nan)

rivers = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(name),
                      'length':pd.Series(length)},axis=1)
rivers.sort('length',ascending=False).head(1)


Out[13]:
country length name
174 CO BR PE 6448 Amazonas

In [14]:
# b)
country, name, area = [], [], []
for l in root.findall('.//lake'):
    country.append(l.attrib['country'])
    name.append(l.find('name').text)
    try:
        area.append(int(l.find('area').text))
    except:
        area.append(np.nan)

lakes = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(name),
                      'area':pd.Series(area)},axis=1)
lakes.sort('area',ascending=False).head(1)


Out[14]:
area country name
54 386400 R AZ KAZ IR TM Caspian Sea

In [15]:
# c) (Answer here appears to be a bit out of date)
country, name, elevation = [], [], []
for a in root.findall('.//airport'):
    country.append(a.attrib['country'])
    name.append(a.find('name').text)
    try:
        elevation.append(int(a.find('elevation').text))
    except:
        elevation.append(np.nan)

airports = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(name),
                      'elevation':pd.Series(elevation)},axis=1)
airports.sort('elevation',ascending=False).head(1)


Out[15]:
country elevation name
80 BOL 4063 El Alto Intl